-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add schema for Airline Reporting Carrier On-Time Performance Dataset #2
base: master
Are you sure you want to change the base?
Conversation
Could you resolve the conflicts? Seems like this is now unblocked |
This dataset is 81GB. Should we hold off on adding it until we add code that properly handles datasets that can't be fully loaded into memory? We probably want to implement download resuming and maybe use some dependency like Dask for loading in large Pandas dataframes. |
@edwardleardi Yes, I agree. How about we make this PR independent from the monolithic issue? |
The problem I had with this one was I had trouble running the tests locally. I can try running them again on a machine with more storage. |
@djalova yeah the tests in this repo actually download and load every dataset part of the dataset schema. I don't think we'll ever get this to pass without actually handling for large datasets like this. The way things stand currently we would need a machine with 81GB memory to load in the dataset when running the test. Plus it would need to download the whole dataset, so I don't think we would want to be doing that for every test. |
@xuhdev Yeah we should definitely make it independent. How do you think we should approach implementing features related to loading large datasets in the future? It would probably be on a loader dependent basis right? Since loading large datasets depends on the dataset type and the Python object you want to load into right? What do you think about creating a new epic for loaders? We can add issues for adding an initial loader implementation like how we did with |
It seems like the need for more beefy CI/CD infrastructure keeps coming up. Let's make a note to discuss this when we get back from the holidays next year. |
@edwardleardi I don't see this issue as very urgent -- most people use the dataset would have a large RAM in place, otherwise the dataset may not be very useful depending on the use case (in case if even a subdataset can't be loaded). If we truly want some hard disk exchange stuff, we may add a About the epic, sure, after first release, we definitely would split different kinds of tasks to different epics if that would help management easier. Currently we only do pre-release and post-release distinction, perhaps because we don't focus on what would happen after the first release yet. |
Large dataset support in general would be interesting, and should the demand arise we should do something to keep up with loading a large dataset, it's just unclear to me yet that what we should do about it. Perhaps we can open an issue and revisit later? |
Got it, good idea
Opened here: https://app.zenhub.com/workspaces/pydax-5fcfdd73254483001e3f3b55/issues/codait/pydax/100 |
No description provided.